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Abstract — Systems that exploit publicly available user gen- 
erated content such as Twitter messages have been successful 
in tracking seasonal influenza. We developed a novel filtering 
method for Influenza-Like-Ilnesses (ILI)-related messages using 
587 million messages from Twitter micro-blogs. We first filtered 
messages based on syndrome keywords from the BioCaster 
Ontology, an extant knowledge model of laymen's terms. We 
then filtered the messages according to semantic features such as 
negation, hashtags, emoticons, humor and geography. The data 
covered 36 weeks for the US 2009 influenza season from 30th 
August 2009 to 8th May 2010. Results showed that our system 
achieved the highest Pearson correlation coefficient of 98.46% 
(p-value<2.2e-16), an improvement of 3.98% over the previous 
state-of-the-art method. The results indicate that simple NLP- 
based enhancements to existing approaches to mine Twitter data 
can increase the value of this inexpensive resource. 

Index Terms — Twitter; natural language processing; infiuenza; 
social media 

I. Introduction 

The Internet has already proved to be an important re- 
source in surveillance systems for tracking infectious disease 
outbreaks, showing that there is an opportunity for low cost 
time-sensitive sources to be exploited to supplement existing 
traditional surveillance systems. Twitter is an Internet service 
that offers a social networking and micro-blogging service that 
allows its users to send and read messages, called tweets. 
Tweets are text-based posts up to 140 characters. It was 
shown that by February 2012 Twitter had over 500 active 
million users, generating over 430 million tweets and handling 
over 1.6 billion search queries per day |]T], Q. Earlier work 
showed that self reports in tweets could be used to predict 
the 2009 A(HINI) swine flu pandemic f3\. Recently, CoUier 
et al. [4 1 analyzed self -protective behavior reports in Twitter 
and showed that there was a moderately strong Spearman's 
rho correlation between these reports and WHO/NREVSS 
laboratory data for A(HINI) in the USA during the later part 
of the 2009-2010 influenza season. Twitter was used to track 
the Influenza-Like-Illnesses (ILI ) rate in two recent studies 
by Lampos and Cristianini |5| and Culotta J?). The key 
idea in these two studies was to choose keywords to filter and 
aggregate influenza-related messages. Lampos and Cristianini 



chose 73 keywords from 1,560 flu-related terms for ILI 
tracking in the United Kingdom (UK) and compared to Health 
Protection Agency lab data with a correlation coefficient of 
97% |5|. Culotta fTl selected only four ILI-related keywords, 
i.e., "flu", "cough", "headache", "sore throat" and reported a 
high correlation coefficient of 95%. More recently. Twitter 
was reported not only as a tool for calculating the ILI rate 
but for tracking public concerns about health-related events. 
Signorini et al. used Twitter to track levels of disease activity 
and public concern in the U.S. during the influenza A HlNl 
pandemic |8|. In order to estimate the ILI rate, they first 
collected Twitter data within the U.S. containing keywords 
"swine", "flu", "influenza" or "hlnl" and then built an ILI 
estimation model using a support vector machine (SVM). The 
results showed relative high ILI rates with an average error of 
0.28% for national weekly ILI levels and 0.37% for regional 
weekly ILI levels. In order to track public concerns, they 
added more public concern keywords such as "travel", "trip", 
"flight" (for disease transmission) or "wash", "hand", "hy- 
giene" (for disease countermeasures) or "guillain", "infection" 
(for vaccine side effects). Then they calculated the percentage 
of observed tweets (tweets including public concern keywords) 
over influenza-related tweets. The results showed that Twitter 
messages can be used as a measure of public interest or 
concern about health-related events. Chew and Eysenbach |j9| 
collected Twitter messages containing keywords "swine flu", 
"swineflu" and "HlNl" and monitored the use of the terms 
"HlNl" versus "swine flu" during the 2009 HlNl outbreak 
and validated Twitter as a tracking system for public attention. 
They showed that several Twitter activity peaks coincided 
with major news stories and the results correlated well with 
HlNl incidence data. For both these studies, keywords play 
an important role in analyzing the content of tweets. 

These related methods combine heuristics and experimenta- 
tion and raise two important questions: (1) Is there a systematic 
method to choose high performance keywords for disease 
tracking? (2) Can richer semantic information contained in 
Twitter messages such as negation, hashtags or emoticons 
help improve disease tracking? We investigated a generic 
approach to filter tweets using two steps. In the first step, we 



filtered messages using keywords derived from the BioCaster 
ontology, which is a multilingual public health terminology 
designed for event surveillance from news media [10] , In 
the second step, we filtered these messages using seman- 
tic features. Given the methodology for choosing keywords 
from the ontology, we called the first step knowledge-based 
filtering and the second semantic-based filtering. We chose 
ILI as a useful example of how to pursue answers to these 
questions. We systematically investigated the relationship be- 
tween Twitter messages and the ILI rate. Season influenza 
(SI) is of particular public health concern since it results 
in about three to five million cases of severe illness in the 
worldwide population, causes 250,000-500,000 deaths per year 
|[TTJ and uses substantial hospital resources. At the same 
time, there is a need to strengthen influenza surveillance in 
order to help public health professionals and governments 
quickly identify the signature of novel pandemic strains like 
2009 A(HINI). Among well-known international and na- 
tional influenza surveillance systems are the WHO Global 
Influenza Surveillance Network (FluNet) [TT\, the US Out- 
patient Influenza-like Illness Surveillance Network (ILINet) 
in the US | [T3| , and the European Influenza Surveillance 
Network (EISN) fT4\. Such systems, which rely heavily on 
data resources collected from hospitals and laboratories, have 
high specificity but often lag 1-2 weeks behind because data 
need to be collected, processed and verified manually by health 
professionals |15|. Additionally, these systems have high costs 
for set up and maintenance. There are several studies targeted 
at using Internet resources for predicting influenza epidemics. 
Event-based systems for example are now being widely used 
for detecting and tracking infectious diseases and new type 
influenza in particular Additionally, various studies have 
used pre-diagnostic signals in self-reports or user searches to 
track the ILI rate. There is currently no standard definition 
of ILI but it is generally recognized as fever with symptoms 
of upper or lower respiratory tract infection. Although many 
reports of ILI will not actually turn out to be influenza, ILI 
tracking has been shown to correlate well with diagnostic data 
for both SI and A(HINI). Chew used user clicks on search 
keywords such as "flu" or "flu symptoms" in advertisements 
and showed a Pearson correlation of 91% with the ILI rate 
in Canada |17|. Polgreen et al. |18| showed that a set of 
queries containing terms "flu" or "influenza" in the Yahoo! 
search engine correlated closely with virological and mortality 
surveillance data over multiple years. Similarly, Ginsberg et 
al. | |T9| developed the widely used system Google Flu Trends 
(http://www.google.org/flutrends/ ) which uses query logs from 
users in the Google search engine and reported a high Pearson 
correlation coefficient of 97%. 

The distinction between this study and previous work rests 
on two important aspects: (1) we propose here a knowledge- 
based method to filter tweets based on an extant and publicly 
available ontology, and (2) we analyzed the role of a wide 
range of semantic features in relation to the ILI rate using 
simple natural language processing (NLP) techniques. 
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Fig. 1. Number of Twitter messages per week for 36 weeks. Week 1 ends 
on September 5th, 2009, week 36 ends on May 8th, 2010. 

TABLE I 

Statistics for the Twitter corpus for 36 weeks. Week 1 ends 

ON September 5th, 2009, week 36 ends on May 8th, 2010, in 
total of 243GB in compressed file (gz format). URL means a 
Twitter message containing a URL beginning with the 

symbol "HTTP"; HASHTAGS MEANS THE TWITTER MESSAGES 
CONTAINING TOKENS STARTING WITH THE "#" SYMBOL. 



Name 


Unique 


Total 


Tweets 




587,290,394 


Users 


23,571,765 




URL 




136,034,309 


Hashtags (#) 


7,963,728 


96,399,587 



II. Methods 

A. Data Sets 

For compatibility and comparison with previous studies, we 
obtained Culotta's data set from Twitter for 36 weeks, starting 
from 30th August 2009 to 8th May 2010 |7|. These data were 
originally collected as part of the work of |20| in which a 
strong correlation (80%) was revealed between certain Twitter 
messages and consumer confidence and political opinion polls 
in the US using keyword filtering. The obtained data set was 
selected by "Gardenhose" real time stream sampling. The total 
number of tweets over these 36 weeks is over 587 million, 
containing about 24.5 million unique users. The size of the 
corpus is about 243 GB in compressed format (.gz files). 
Characteristics of the Twitter corpus are summarized in Table |l] 
and the weekly number of tweets is shown in Figure [T] 

Data for the ILI rate from the CDC's U.S. Outpatient 
Influenza-like Illness Surveillance Network (ILINet) was con- 
sidered as the gold standard [13|. ILINet consists of more than 
3,000 healthcare providers in all states of the US, reporting 
over 25 million patient visits each year. The CDC publishes 
national and regional ILI rates based on weekly reports from 
approximately 1,800 outpatient care sites around the US. In 
this article, the ILI rate for the US was considered as the gold 
standard so that results could be compared to those in |7j. 



B. Keyword-based Filtering 



TABLE III 

List of extra keywords added for respiratory syndrome. 



1) Empirical approach: We first used Culotta's method 
as described in fl]. This method used four keywords "flu", 
"cough", "headache", "sore throat" and was reported to have 
achieved the highest Pearson correlation coefficient of 95%. 
We called it Culotta4 and refer to it as the baseline for 
comparison. The reason for choosing all four keywords in 
Culotta's work is that all of them refer to CDC's ILI definition, 
which is "fever (temperature of 100 F [37. 8C] or greater) 
and a cough and/or a sore throat in the absence of a known 
cause other than influenza" p3) . The keyword "fever" was not 
chosen since it is highly metaphoric and contained in many 
irrelevant messages such as "I've got Bieber fever" which 
referred to the pop star Justin Bieber |[7). In addition we used 
two other filtering methods using keywords: one is described 
by Signorini et al. |j8J which used four keywords "hlnl", 
"swine", "flu", "influenza", and the other is described by Chew 
|[TT| which used three keywords "hlnl", "swineflu", "swine 
flu". We called the former Signorini4 and the latter Chew3. 

While applying keywords filtering is simple, choosing the 
optimal set of keywords is not trivial. Previous methods were 
based on a try-and-test strategy, and hence would be difficult 
to generalize to other diseases. Although our method also 
relies on expert judgement, it approaches the problem in 
a systematic manner. In previous work, a set of candidate 
keywords or queries are selected from a list of pre-defined 
keywords; second, models are built based on these candidates 
and compared; finally, the top candidates are chosen based on 
the highest level of correlation |5|, |18|, |19|. These methods, 
in our opinion, are complex and require a lot of computation 
time. Flu Detector chose 73 keywords from 1,560 flu-related 
terms on Twitter data Q. Google Flu Trends (GFT) tested 
each query among over 50 million candidates and finally 
established the top 45 query terms 119]. Recently, Cook et al. 
||21 1 proposed an updated model for GFT and showed that it 
performed better than GFT during the 2009 HlNl pandemic. 
The main difference bewteen these two models is the choice 
of query terms. The updated model included approximately 
160 query terms compared with 45 in the original GFT, with 
some potential overfitting. The terms in the updated model are 
more directly related to influenza, especially terms related to 
influenza symptoms. For example, queries in the categories 
"general influenza symptoms" and "specific influenza symp- 
toms" comprise 69% of the updated model volume, and 72% 
of the updated model queries contain the word "flu". This 
result suggested that if we choose more appropriate keywords, 
we may be able to improve the accuracy of the ILI rate. 
Moreover, keywords related to influenza symptoms or "flu" are 
important to calculate the ILI rate. We propose an alternative 
approach, not previously tried for this task context, that uses 
syndrome related terms from an extant ontology. We call this 
the knowledge-based approach. 

2) Knowledge-based approach: We started from the ver- 
sion 3 of the BioCaster Ontology (BCO) (| http://www.code.| 
Igoogle ■ com/p/biocaster- ontology/) , which is part of the Bio- 
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Caster project developed by an international team from the 
National Institute of Informatics, Japan (22]. The BCO con- 
tains domain terms in public health such as disease, agents, 
symptoms, syndromes and species with a focus on laymen's 
terms that appear in newswire. Syndromes are classified into 
six categories: dermatological, gastrointestinal, hemorrhagic, 
musculoskeletal, neurological, and respiratory syndromes. The 
respiratory syndrome contains 37 symptom keywords as 
shown in Table |ll] The simplest method of employing this 
knowledge is to use all these syndrome keyword for filtering, 
i.e., we keep all tweets which contain at least one keyword 
in Table |ll] This method is called Syndrome. We noticed that 
the respiratory syndrome did not include the keyword "flu" 
which was shown to be an important indicator in the empirical 
approach since it directly grounds message content to ILI |5J, 
|7| . Thus, we added the keyword "flu" into the Syndrome 
method and called it Syndrome-\-"flu" . 

In addition to technical terms in Table |Il] we investigated 
and listed extra terms in Table III Basically they are extended 
from Table [ll] and often used daily in the sublanguage of 
Twitter messages. These informal terms could be mapped to 
some ontology concepts. The method use all keywords in 
both Table |ll] and Table III for filtering is called Syndrome 
+ Extra terms. The purpose of adding more keywords is that 
we intended to closely model the types of informal language 
used in tweets to talk about influenza. 

Before detailing the semantic features employed in our 
study, we briefly review related work on categorizing tweets. 
In an earlier survey, |23| categorized general blogs based 
on user motivations into five groups: (1) to document user's 
lives, (2) to provide commentary and opinions, (3) to express 
deeply felt emotions, (4) to articulate ideas through writing, 
and (5) to form and maintain community forums. Since Twitter 
messages are micro-blogs, which are limited to 140 characters, 
they also have their own unique characteristics. From the 
viewpoint of user intensions, |24| observed that tweets can be 



TABLE II 

List of keywords used for respiratory syndrome from the BioCaster Ontology flOl . 
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Fig. 2. A general schema for filtering Twitter data. 



classified into four groups: (1) daily chatter, (2) conversations, 
(3) sharing information or URL, and (4) reporting news. It is 
easy to see that we can group two first categories in the study 
of |24| as individual reports, and the remaining as public 
reports or comments on external events. We separated them 
into two such groups because (1) the ILI rate depends on 
the number of specific ILI cases, i.e., individual reports, and 
(2) two both categories can be easily recognized using the 
presence of a URL. We assumed that individual reports do 
not usually contain a URL due to the lack of space, whereas 
comments on external events often contain URLs within their 
text. Examples of tweets about external events include: "7- 
year-old boy dies of flu, pneumonia <URL>" or "Major 
swine flu symptoms:Sore throat, prod, cough, runny nose, 
fever, headache, lethargy - <URL> ". We called the individual 
reports obtained after filtering to remove those messages with 
URLs as Syndrome+"flu"-URL. 

C. Semantic-based Filtering 

In order to identify useful features that might help filter 
Twitter messages, we chose five candidate semantic features: 
negation, hashtags, emoticon, humor, and geography. Below 
we explain the reasons for these choices and the way we 
extracted them from the Twitter messages. Since the ILI rate 
depends largely on identifying individual cases, we explored 
those features after filtering for individual reports. A general 
schema for filtering in this paper is depicted at Figure |2] 

We first discuss the use of negation in tweets, which have 
a wide range of linguistic functions in English. In general, 
determining the scope of negation is quite complex and 



challenging for linguistics in general and sentiment analysis in 
particular |25|. We were motivated by the simple observation 
that a negative sentence that talks about users who are not 
infected by ILI, usually has a negation before the term "flu". 
For example, "It's not about having flu" or "it's not swine 
flu" are negative sentences, but the sentence "not feeling well, 
got flu and cough" is not a negative sentence in this regard. 
Two processing steps are necessary: (1) breaking tweets into 
individual sentences, and (2) parsing each sentence and iden- 
tifying the negation. To this end, we employed the RASP 
grammatical parser developed at the Computer Laboratory 
in Cambridge University |26|. The RASP parser works in 
a pipeline as follows. First, it takes raw text as the input, 
runs a sentence boundary detection program to break it into 
multiple sentences; second, it runs a tokenization program, 
and each token is then tagged by one of 150 part-of-speech 
and punctuation labels derived from the CLAWS tagset | [27) . 
Finally, the parser uses a generalized parser to choose the 
most probable parsing tree based on probabilistic ranking. 
The parsing tree is then converted into a grammatical relation 
that determines relationship between subjects and objects in a 
sentence. We identify negation as follows. First we identify 
whether the tag name for negation (named "XX" in the 
CLAWS tagset) is present in a sentence containing the term 
"flu". Second, we apply the simple rule: if there is a direct 
or indirect grammatical relationship between the negation and 
the term "flu", then we remove that tweet. The idea behind 
this rule is if texts such as "it's not swine flu" or "I don't have 
flu" then they will be removed. We called this filtering method 
Negation. A scheme and an example for detecting negation are 
depicted in Figure |3] 

Hashtags and emoticons were investigated previously as part 
of the filtering problem and showed to be important indicators 



for improving the accuracy of filtering p8|-|30|. A hashtag 
is a token beginning with the symbol "#". It is considered as 
a community-driven convention for adding additional context 
and metadata to tweets. Hashtags were developed as a means 
to create "groupings" on Twitter, helping users to emphasize 
or group important information or topics, therefore they could 
be important indicators for the ILI rate. We identified ILI- 
related hashtags as ones in any sentence containing respiratory 



Sentence splitter 



Negation 
detection 



Hev , at least it be+ not+ swine flu ! 
MPl , II RR PPHl V8Z XX mi mi 



Fig. 3. The scheme for detecting negation in tweets and an example. In 
the example, "NPl", "NNl", "NN2", "VBZ", "PPHl", "11", "RR", "XX" are 
tags defined in CLAWS taget 127], "mod", "ncmod", "ncsubject", "xcomp" 
are grammatical relationships defined in the RASP parser |26|. We identify 
the negation by the following rale: if there is direct or indirect grammatical 
relationship between the negation tag ("XX") and the term "flu" , we remove 
the tweet. In this example, there is an indirect relationship between them 
through "be+" (VBZ), therefore, this tweet is removed. 



syndrome keywords in Table [II] or the term "flu". For example, 
"Still coughing smh #swineflu #hlnl". If a tweet contained a 
hashtag that was not related to ILI, we removed it. We called 
this filtering based on ILI-related hashtags HashTags. 

Emoticons are another key feature in tweets that express the 
mood of users, e.g., anxiety, anger, happiness. We used a list of 



emoticons from Wikipedia 1 3 1 1 to identify "smiley" emotions 



of users in sentences. They are emoticons with smiley or laugh, 
hug, love, heart, or groups of characters such as ":-)", ":)", 
":D". For example, "Glad to hear that you're beating the flu. 
:-) Hope you don't get the nasty cough that everyone's getting 
this year". These smiley icons might not be relevant to ILI. If 
a tweet contained emoticons with smiley or laugh, hug, love, 
heart, or shade icons, it was filtered out. We called this method 
Emoticon. 

Humor features indicate whether the content of tweets is 
a joke, irony or humor and could be a strong clue that the 
tweet is not relevant to ILI. We identified humor features with 
keywords such as "haha", "hihi" or "***cough ... cough***". 
We observed lots of tweets containing the phrase "***cough ... 
cough***" are jokes, for example, "Hm Im kinda wanting to 
go to NYC really soon ***cough ... cough*** @Ctmomofsix 
=)". We called this filtering method Humor. 

Geography is an feature associated with Twitter messages 
that could have particular usefulness in surveillance systems. 
Such information is not located within the tweet's message 
line, but is stored optionally in the user's profile data. Since 
each user is free to enter any information, the geographical 
information can be a country, city or town name such as 
"NY, USA", "LA", or latitude/longitude from mobile devices 
such as "\uOOdct: -7.272681,112.755908", or even nonsense 
information. Since CDC data cover only the USA, we needed 
to keep only Twitter messages for this country. To do this, we 
sent queries about Twitter's users locale to Google Map (http:| 
l77rnaps.google.com) and obtained returned results in JSON 
format. We wrote a simple parser in Python to parse these 
returned results to get information about the country. We 
selected tweets which only came from the USA and designated 



this filtering method as Geo. 

III. Evaluation 

A. Evaluation measures 

We used the Pearson product-moment correlation coefficient 
to measure the correlation between fluctuation in the number 
of tweets according to each of the filtering methods described 
above and the CDC data on ILI rates. Although not very 
sophisticated, this measure was used in previous work to 
compare standard time series data (5). lUZl' Gl)' fZi)' 
so we adopted for easy comparison of our methods with 
previously published algorithms. As a reminder to the readers, 
the Pearson correlation coefficient between two variables X 
and Y is calculated as follows. 

where Xi,Yi are samples and X, Y are sample means of 
variables X and Y, respectively. The value of r represents 
the linear relationship between the variables and ranges from 
-1 to -1-1. It is -1-1 or -1 in the case two variables are perfectly 
correlated or anti-correlated, respectively. If it is 0, there is no 
linear correlation between the variables. 

In our experiments, X designates the ILI rate from CDC 
data, and Y was a normalized value representing the frequency 
of filtered Twitter messages. Since CDC data are reported 
weekly (starting on Sunday and ending on Saturday), we 
bucketed Twitter messages by week. Xi and Yi were the 
number for the ILI rate and the normalized number of filtered 
Twitter messages for the week number i (i = 1, 36), 
respectively. Statistical analyses were implemented using R 
packages ( |http://www.r-project.org/| l. 

B. Results 

The result of different keyword-based filtering methods is 
shown in Table |IV] with all p-value<2.2e-16. Among three 
keyword filtering methods, Culotta4 had the highest correla- 
tion of 94.85%, which was slightly but not statistically higher 
than Signorini4 (94.73%) and Chew3 (94.48%). Culotta4's 
method kept the highest number of tweets (more than 1.8 
million tweets), while Chew3 just got over 300 thousands 
tweets. This contrasted with the results from another source, 
Google Flu Trends, which used search query logs and achieved 
a correlation coefficient of 99.12%. 

Table |IV] shows that the correlation coefficient of Cu- 
lotta4 was 94.48%, significantly higher than Syndrome at 
88.60% (p=0.03). However, when adding the keyword "flu" 
into Syndrome, the coefficient increased to 97.13%, which 
was significantly higher than Culotta4 (p=0.0006). This in- 
dicated that using keywords derived from the knowledge- 
based ontology yielded better results when compared to simple 
syndrome keywords only. The keyword-based methods also 
showed that removing URLs from tweets helped significantly 
improve correlations to 97.52% with Syndrome + "flu" - 
URL (p=0.0002). Interestingly, these results supported the 
results of Cook et al. ||2T| that showed that by considering 



more "influenza symptom" keywords to search query logs, the 
Google Flu Trend models improved. This implies that not only 
the term "flu", but symptom keywords are probably important 
to calculate the ILI rate in both search query logs and Twitter 
data. 

Adding the keyword "flu" into the filtering method led the 
correlation to increased significantly (nearly 10%, p=0.0002), 
but the number of tweets decreased dramatically. On the other 
hand, combining extra terms into syndrome terms as in the 
Syndromes + Extra terms method increased the correlation 
from 88.60% to 95.78% (p=0.007), but stifl kept the number of 
tweets high more than 880,000 tweets). Compared to Culotta4, 
Syndromes + Extra terms had higher correlation but not 
significant (p=0.29). 

Results showing the effectiveness of each semantic feature 
and their combination were shown in Table [V] (all p-value 
<2.2e-16). We designated a combination by adding a symbol 
"+" between features. For example. Negation + Emoticon 
means the combination between the filtering methods Negation 
and Emoticon. Among the four semantic features, negation 
proved to be the least effective since it improved Syndrome 
+ "flu"- URL only from 97.52% to 97.65%. The other 
remaining features showed a slightly positive effect but also 
not statistically significant improvement. Table [V] shows that 
the effectiveness of semantic features on the ILI rate could be 
ranked by the following order: Emoticon <= Hashtags <= 
Humor <= Negation. 

Geographical features shown in Table [V] and Table |VI] 
helped increase correlation coefficients from 97.52% to 
98.39% for Syndrome + 'flu"- URL and 95.78% to 96.03% 
for Syndromes + Extra terms but not statistically significant 
(p=0.18 and 0.45, respectively). We applied to filter by seman- 
tic features for Syndromes + Extra terms, but we could not 
see any improvement, except for Geo which helped increase 
the correlation coefficient from 95.78% into 96.03% but not 
significant (p=0.45). This yields that geographical features 
might be more important than semantic features. Note that 
all p-values for differences take into account the number of 
tweets in each set. 

When combining all semantic features, the best correlation 
coefficient was 98.46% {Negation + Humor + Emoticon + 
HashTags + Geo), an improvement of 3.98% compared to 
Culotta4. 

For comparison, normalized Twitter data of 
Syndrome +" flu" , Best Combination, i.e. Negation + Humor 
+ Emoticons + HashTags+ Geo, Baseline {Culotta4), GFT 
and CDC data are shown in Figure |4] 

IV. Discussion 



We observed from Table IV that the knowledge-based 



approach outperformed the empirical approach (97.52% cor- 
relation coefficient foiSyndrome -H "flu"- URL, compared to 
94.85% for Culotta4, p=0.06). As discussed in the Introduc- 
tion, choosing keywords for filtering is very important and 
requires complex modeling to construct appropriate lists. The 
main difference between the knowledge-based approach and 
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Fig. 4. Normalized Twitter data of Syndrome+"flu" , the Best Combination 
(Negation + Humor + Emoticons + HashTags+ Geo), Baseline (Culotta4), 
Google Flu Trend (GFT) (for reference only, as it uses a different type of 
data), and CDC data (gold-standard). Y axis represents the ILI rate value (%), 
values for GFT represent Google Flu Trends, and CDC data represents CDC's 
U.S. Outpatient Influenza-like Illness Surveillance Network. Other values are 
normalized for the number of tweets per week that correspond to each filtering 
method, i.e., i^Tx (number of tweets in the filtering method)/(total number of 
tweets that week), where K = 10*, 10®, lO'^ for Cullota4, Syndrome + "flu", 
and Best Combination, respectively. 



Culotta4 is that the knowledge-based approach first filtered 
tweets based on syndrome keywords and then filtered influenza 
from syndrome cases, while Culotta4 simply filtered using any 
of four keywords, i.e., "flu", "headache", "sore throat", and 
"cough". On the other hand, our method can obtain tweets 
that contain information on the combination of syndrome and 
influenza, thus yielding higher correlation with CDC rates. For 
example. Syndrome -i- "flu"- URL can contain tweets such as 
"Down with a flu. Sore throat. Its burning like hale" which 
requires a respiratory syndrome keyword in the BCO ontology 
("sore throat") first and then the keyword "flu". Culotta4 had 
false positive in messages such as "Guh, lack of sleep and 
a whole lot of algebra homework makes for one hell of a 
headache" or "Has a wicked headache, as well as a new bill 
to pay" which have no relation to influenza. 



Table IV also shows that individual reports had a positive 
effect on the ILI rate. Individual or personal reports, i.e., those 
where the subject of the message was the author, achieved a 
correlation coefficient of 97.52% {Syndrome-\- "flu "), compared 
to 97.12% of comments on external events {Syndrome-\-"flu" - 
URL) (p=0.38). We believe that this may have happened 
because the ILI rate is reflected in reports of individual and 
specific cases, but may not be reflected in general cases or 
undetermined groups. Therefore, individual reports seem more 
helpful to improve the ILI rate. For example, a tweet "bad 
cough bad flu " which implicitly indicates that the user has a 
ILI would be more useful than an event-based tweet such as 
"7-year-old boy dies of flu, pneumonia <URL>". 

Table [V] and Table VI show that the geographic features 
help to significantly improve performance in comparison to 
semantic features. In Culotta4, it is implicitly assumed that 



TABLE IV 

Results using keyword-based methods in Twitter messages (results for Google Flu Trends (GFT) is shown for reference, 

SINCE GFT USES SEARCH QUERY DATA, NOT TWEETS) (ALL P-VALUE <2.2E-16). BOLDED TEXTS SHOW THE BEST CORRELATION COEFFICIENT 

SCORES. 





Keyword-based nltering 


#tweets 


Pearson correlation 
coefficient (%) 


Empirical 


CuIotta4 (baseline) 


1,812,682 


94.85 


approacli 


Signorini4 


1,294,459 


94.73 




Chew3 


307,884 


94.48 




Google Flu Trends 


NA 


99.12 




Syndromes only 


386,199 


88.60 


Knowledge-based 


Syndromes + "flu" 


9,034 


97.13 


approacli 


Syndromes + "flu" - URL 


8,485 


97.52 




Syndromes + Extra terms 


887,718 


95.78 




Syndromes + Extra terms - URL 


668,611 


94.43 



TABLE V 

Results using semantic-based methods with individual reports in Twitter data (all p-value<2.2e-16). Bolded texts show 

THE best correlation COEFFICIENT SCORE. 



Semantic-based Filtering #tweets Pearson correlation 

coefficient (%) 



Individual reports (Syndromes + "flu" - URL) 


8,485 


97.52 


Negation 


7,359 


97.65 


Emoticon 


8,285 


97.52 


HashTags 


8,315 


97.61 


Humor 


8,388 


97.65 


Geo 


2,214 


98.39 


Negation -I- Emoticon 


7,192 


97.62 


Negation -I- HashTags 


7,218 


97.70 


Negation -I- Humor 


7,268 


97.74 


Negation -I- Emoticon -I- HashTags -I- Humor 


6,978 


97.76 


Negation + Emoticon + HasliTags + Humor + Geo 


2,180 


98.46 



TABLE VI 

Results using Geo filtering with Syndromes + Extra terms (ALL P-VALUE<2.2e-16). We ONLY SEE AN IMPROVEMENT BY FILTERING BY 
GEOGRAPHY Geo, BUT NOT FOR OTHER FEATURES (NEGATION, EMOTICON, HASHTAGS, HUMOR). 



Semantic-based Filtering 


#tweets 


Pearson coiTelation 






coefficient (%) 


Syndromes + Extra terms 


887,718 


95.78 


Geo 


213,989 


96.03 



the ILI rate in the US can be calculated based on Twitter 
messages from worldwide Twitter data. However, in reality, 
the ILI rate varies across administrative regions, including 
nation states. Moreover, the number of new Twitter users has 
increased dramatically daily with a number of 300,000 per 
day based on statistics for April 2010 p2| . Geography was an 
important feature to track with the ILI rate in our data set. In 
retrospect, we could explain why features such as negation did 
not help, and could have even harmed the correlation: even if 
individuals did not have ILI, they were aware of it and could 
be commenting on close friends or family, who might not be 
tweeting about their own ailment. The same applies for Humor, 
Emoticons (which could represent sympathy messages), and 
HashTags. 



Table VII shows types of tweets after semantic-based filter- 
ing with individual reports. We classified the classified tweets 
into five types: influenza confirmation, influenza symptoms, flu 



shots, self protection, and medication. Among them, influenza 
confirmation and influenza symptoms were the most frequent. 

There were several difficult cases in our method that need to 
be filtered out. For example, tweets about flu related opinions 
such as "Just boarded my flight-hearing coughs, sneezes, 
sniffles. Gross. Maybe this us how I got a cold. Cold/flu 
season is here.". Difficult cases also include tweets about 
flu vaccination and other prevention measures, e.g., "is now 
properly vaccinated against the flu and pneumonia... oh and 
has a TB skin test..hmmm it is turning more red than usual", 
or information about flu symptoms such as "flu symptoms 
seem to have a cascade order - starts with Lsore throat then 
2. fever 3.aches&pains 4.coughing". Such messages are still 
not filtered out using our method. These cases may need to be 
handled by advanced natural language processing methods. 

We compared our results to Google Flu Trends (GFT) 
over the same period. The correlation coefficient of GFT to 



TABLE VII 

Types of tweets that are kept after semantic-based filtering of individual reports. 



Types Tweet samples 

Influenza I got flu n coughed a lot. Now my voice is like monster's voice. Rrr 

confirmation 

@lisarob54 thanks lisa. He's got flu, bad head, aching all over, rotten 
cough. Hasn't started making piggy noises yet tho!!! 

Barber just coughed on me in the chair. Pretty sure I now have swine flu 

Are a sore throat and backache signs out the swine flu? Do I need to call 
the death squad? 

My day: flu-like symptoms (headache, body aches, cough, chills, 100.9 fever). 
Swine flu not ruled out. #H1N1 

Wish I could stay & chat (or write a blog post) buti think I have the flu: 
aches, chills, fever, sore throat - it ain't purty. off to bed. 

Flu shots I'm still getting flu shots, nothing is worth flu turning into bronchitis 

into pneumonia 

I got a flu shot was the sickest I ever was . . .ended up with bronchitis and 
worse #ecowed 



Influenza 
symptoms 



@pauloflaherty I'm home with bronchitis. Doc did strep test and flu test as 
well. The flu test was torture! 



Self protection Cover your mouth if coughing, use a tissue, wash your hands often & get a 
flu shot - protect and defend your community from #H1NI 

5 tips 2 keep you safe from the flu; When sick, stay home. Get flu shot. Hand 
wash/sanitize. Cover cough/sneeze. Healthy eating & exercise. 

hands are dry so applied lotion, lotion is scented so nose runs, kleenex 
to runny nose, wash hands be of swine flu then have dry hands again 

Medication Wondering why I didn't take the flu shot, laying in bed with cough 

drops, medicine, and the remote 

... Cough medicine then sleep ... God protect me from the pig flu ... Plz and 
thnx ... K g' night twatters 



CDC data was 99.12%, higher than our best resuhs, 98.46% 
{Negation + Emoticon + HashTags + Humor + Geo) but not 
statistically significant (p=0.13). We think one of the reasons 
for the difference between GFT and our methods relates to 
the quality of the data. GFT can access the whole query log 
of users from the Google search engine, while our data is a 
subset of the total Twitter data, with an unknown sampling 
rate. As we do not know the sampling rate precisely, we 
speculate that if the sampling rate were higher, the correlation 
coefficient could get closer to that of GFT. Furthermore, 
GFT can accurately identify geographical location by looking 
up the static IP address of machines, and Twitter's profile 
does not reveal IP address, but primarily free text submitted 
by the users. However, Twitter provides API functions at 



https://dev.twitter.com that can classify tweets by geolocations 
through latitude/longitude or city/province names. For exam- 
ple. Flu Detector retrieved tweets through this way by limiting 
to 49 urban centres in the UK only ||5). Moreover, Twitter 
data are public and accessible freely through the Twitter API 
function, while users query data from Google users is closed 
and cannot typically be accessed by the research community. 



There are still several remaining avenues to improve the 
methods presented here. The first is establishing the scope of 
negation, which plays an important role in sentiment analysis 
p5| , especially when there are more than two negations within 
a tweet. For example, "yes i would really like to stop coughing, 
no i do not have swine flu". The second is determining 
whether the tweet is factive or modal, reflecting user levels 
of belief in a particular condition, e.g., "got cold but not 
sure 1 got flu". The third avenue is to develop an explicit 
linkage with sentiment analysis and opinion mining |33|. For 
example, determine the relation user responses to influenza 
news happening elsewhere and the ILl rate. Additionally, 
deeper analyses of demographics such as age, gender as well 
as user preferences, which are available in user profiles in 
Twitter could reveal important clusters. Such data are not 
available in Google Flu Trends, and could be an important 
enhancement for flu risk analysis. 

V. Conclusion 

This study has shown that Twitter messages can be used 
to track the ILI rate with a high degree of correlation to 
official government data. By analyzing Twitter messages, we 



showed that the use of keywords based on a knowledge- 
based approach is beneficial. Furthermore, the use of semantic 
feature filtering was shown to be useful for selecting tweets 
based on geography. The approach is systematic and general, 
so it may be applicable to a wide range of diseases and 
syndromes. Further investigation in this direction is warranted. 
Other fruitful areas for future study include detection of 
predictive signals, and integration of data signals from social 
and news media. We implemented an experimental system 
called Dizie (Disease Information Extraction), available at 
http://born.nii.ac.jp/dizieproj/_dev/ to track ILI and five other 
syndromes using the methods and insights from this study. 
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